113 research outputs found
Future Frame Prediction for Anomaly Detection -- A New Baseline
Anomaly detection in videos refers to the identification of events that do
not conform to expected behavior. However, almost all existing methods tackle
the problem by minimizing the reconstruction errors of training data, which
cannot guarantee a larger reconstruction error for an abnormal event. In this
paper, we propose to tackle the anomaly detection problem within a video
prediction framework. To the best of our knowledge, this is the first work that
leverages the difference between a predicted future frame and its ground truth
to detect an abnormal event. To predict a future frame with higher quality for
normal events, other than the commonly used appearance (spatial) constraints on
intensity and gradient, we also introduce a motion (temporal) constraint in
video prediction by enforcing the optical flow between predicted frames and
ground truth frames to be consistent, and this is the first work that
introduces a temporal constraint into the video prediction task. Such spatial
and motion constraints facilitate the future frame prediction for normal
events, and consequently facilitate to identify those abnormal events that do
not conform the expectation. Extensive experiments on both a toy dataset and
some publicly available datasets validate the effectiveness of our method in
terms of robustness to the uncertainty in normal events and the sensitivity to
abnormal events.Comment: IEEE Conference on Computer Vision and Pattern Recognition 201
E2E-LOAD: End-to-End Long-form Online Action Detection
Recently, there has been a growing trend toward feature-based approaches for
Online Action Detection (OAD). However, these approaches have limitations due
to their fixed backbone design, which ignores the potential capability of a
trainable backbone. In this paper, we propose the first end-to-end OAD model,
termed E2E-LOAD, designed to address the major challenge of OAD, namely,
long-term understanding and efficient online reasoning. Specifically, our
proposed approach adopts an initial spatial model that is shared by all frames
and maintains a long sequence cache for inference at a low computational cost.
We also advocate an asymmetric spatial-temporal model for long-form and
short-form modeling effectively. Furthermore, we propose a novel and efficient
inference mechanism that accelerates heavy spatial-temporal exploration.
Extensive ablation studies and experiments demonstrate the effectiveness and
efficiency of our proposed method. Notably, we achieve 17.3 (+12.6) FPS for
end-to-end OAD with 72.4%~(+1.2%), 90.3%~(+0.7%), and 48.1%~(+26.0%) mAP on
THMOUS14, TVSeries, and HDD, respectively, which is 3x faster than previous
approaches. The source code will be made publicly available
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
This paper presents a novel hierarchical alignment model (HAM) that learns
multi-granularity visual and linguistic representations in an end-to-end
manner. We extract key points and proposal points to model 3D contexts and
instances, and propose point-language alignment with context modulation (PLACM)
mechanism, which learns to gradually align word-level and sentence-level
linguistic embeddings with visual representations, while the modulation with
the visual context captures latent informative relationships. To further
capture both global and local relationships, we propose a spatially
multi-granular modeling scheme that applies PLACM to both global and local
fields. Experimental results demonstrate the superiority of HAM, with
visualized results showing that it can dynamically model fine-grained visual
and linguistic representations. HAM outperforms existing methods by a
significant margin and achieves state-of-the-art performance on two publicly
available datasets, and won the championship in ECCV 2022 ScanRefer challenge.
Code is available at~\url{https://github.com/PPjmchen/HAM}.Comment: Champion on ECCV 2022 ScanRefer Challeng
Amodal Segmentation Based on Visible Region Segmentation and Shape Prior
Almost all existing amodal segmentation methods make the inferences of
occluded regions by using features corresponding to the whole image. This is
against the human's amodal perception, where human uses the visible part and
the shape prior knowledge of the target to infer the occluded region. To mimic
the behavior of human and solve the ambiguity in the learning, we propose a
framework, it firstly estimates a coarse visible mask and a coarse amodal mask.
Then based on the coarse prediction, our model infers the amodal mask by
concentrating on the visible region and utilizing the shape prior in the
memory. In this way, features corresponding to background and occlusion can be
suppressed for amodal mask estimation. Consequently, the amodal mask would not
be affected by what the occlusion is given the same visible regions. The
leverage of shape prior makes the amodal mask estimation more robust and
reasonable. Our proposed model is evaluated on three datasets. Experiments show
that our proposed model outperforms existing state-of-the-art methods. The
visualization of shape prior indicates that the category-specific feature in
the codebook has certain interpretability.Comment: Accepted by AAAI 202
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Sequential video understanding, as an emerging video understanding task, has
driven lots of researchers' attention because of its goal-oriented nature. This
paper studies weakly supervised sequential video understanding where the
accurate time-stamp level text-video alignment is not provided. We solve this
task by borrowing ideas from CLIP. Specifically, we use a transformer to
aggregate frame-level features for video representation and use a pre-trained
text encoder to encode the texts corresponding to each action and the whole
video, respectively. To model the correspondence between text and video, we
propose a multiple granularity loss, where the video-paragraph contrastive loss
enforces matching between the whole video and the complete script, and a
fine-grained frame-sentence contrastive loss enforces the matching between each
action and its description. As the frame-sentence correspondence is not
available, we propose to use the fact that video actions happen sequentially in
the temporal domain to generate pseudo frame-sentence correspondence and
supervise the network training with the pseudo labels. Extensive experiments on
video sequence verification and text-to-video matching show that our method
outperforms baselines by a large margin, which validates the effectiveness of
our proposed approach. Code is available at https://github.com/svip-lab/WeakSVRComment: CVPR 2023. Code: https://github.com/svip-lab/WeakSV
- …